DOCJDKStll B£SUfl£ 



ED 135 840 



95 



IB 006 0-67 



AUIBO£ 
TITLE 

IHSIiTUTIOH 

SPOUS AGENCY 

BEEOEI SO > 
PUB DATE 
COHIEACT 
«CTE 

EDBS EBICE 
DESCBIPIOBi 

IDEHTIFIEES 



Kosecoff, JacgueliDe; And Others 

A SysteB for Describing and Evalaatlng 

CriterioD-Bef erenced Tests. 

ERIC Clearinghouse on Tests, Heasuremeiit « and 
EvaluatloQf Frincetoiif u.J. 

tiational Inst, of Educatioii (DHEH) , Uashingtoiif 
D.C. 

EBIC-TK-57 
Dec 76 
400-35-0015 
^5p. ^---^^ 

1 
I 

flF-$0,83"flC-$/,<i7 Plus Postage. 

♦ Criterion Beferenced Tests; *Evaluatioii; *EvaluatioL 
Criteria; *Evaluatioii Hethods; *Eatiiig Scales 
Criterion Beferenced lest Description Evaluation 



AESIEACI 

There are, at present, a nuotber of tests that are 
labeled critericn referenced. These tests vary considerably in 
forioat, design, analysis, and function. In order to provide an 
efficient and objective procedure for describing, assessing, and 
ccmpari:?g these nteasures, the criterion Beferenced Test Description 
ajid Evaluation (CBTD£) rating systeia Mas developed. This system 
incorporates general concern for the overall characteristics and 
usability of a CBI and a specific concern for the technical 
excellence Mith Mhich the, CBT Mas developed and analyzed. The CBTDE 
rating fora is divided into eight parts. The first three pertain to 
overall CBT characteristics and usability; (1) marketing and 
packaging, (2) exaainee appropriateness, and (3) administrative 
usability- The second five pertain to CBT technical excellence; (4) 
function and purpose, (5) objectives development, (6) item 
development, (7) methods cf score interpretation, and (8) analysis 
and validation. In addition to an explanation of the rating system, 
this paper includes detailed iiistructions so that it can be used in 
srandardi2ed and accurate May by school personnel, test*selection 
ccfiaittees, researchers, _and professional e valuators. (iuthor/BC) 



* Doc'imejits acguired by ERIC include many informal unpublished * 

* materials net available from other soUiT^'es. EBIC ntakes every effort * 

* tc obtain th€ best copy available. Nevertheless, items of iQarginal * 

* reproducibility are often encountered and this affects the guality * 

* of the microfiche and hardcopy reproductions EBIC makes available * 

* via the ERIC Bocument fieproduction Service (EDSS) . EDBS is not * 

* responsible for the guality of the original document. Reproductions * 

* supplied by IDBS are the best that can be made from the original. * 



o 

CO 

tr\ 

CD 



I ERIC CLEARINGHOUSE ON TESTS, MEASUREMENT, & EVALUATION 

ni I W# EDUCATIONAL TESTING SERVICE, PRINCETON, NEW JERSEY 08540 



TM REPORT 57 



DECEMBER 1976 



A SYSTEM FOR OESCRIL^ING AND EVALUATING CRITERION REFERENCED TESTS" 
Jacqueline Koeecoff, Arlene Fink, & St€pben P. Klein ^ 

ABSTRACT 

There at present* a number of tests that are Jabeled cntenon-referenced* Thesetests 
vary considerably in format^ design, analysis, and function. In ordor to provide an effi- 
cient and objective procedure for describing, assessing, and comparing these measures* 
the Criterion-Referenced Test Description and Evaluation (crtde) rating system was 
developed. This system incorporates a general concern for the overall characteristics 
and usability of a cht and a specific concern for the techmcei excellence with which the 
CRT was developed and analyzed. 
The CRTDE rating form is divided into eight part? that reflect these concerns: 



Overall CRT Characteristics and l^sability 
L Marketing and Packaging 

2. Examinee Appropriateness 

3. Administrative Usability 

CRT Technical Excellence 

4. Function and Purpose 

5. Objectives Development 

6. Item Development 

7. Methodsof Score Snterp"?tation 

8. Analysis and Validation 



COUCATJOHlWCtFAftC 
HJITIOnAL INSTITUTE OF 



In addition to an explanation of the rating system, this paper includes det&iled in- 
structions so that it can be used in a standardized and accurate way byschoolpersonnel, 
t«st -selection conmiittees, researchers, and professional evaluators- 



INTRODUCTION 



to 

CO 

o 



Public education has been submitted to much scrutiny 
dunng the past decade, and the curriculum, instruction, 
and techniques for evaluating students and educational 
programs have been the focus of considerable debate. 
Among the subjects still being fervently discussed is the 
need to identify appropriate ways of measuring and Tie- 
scribing how much and how successfully students leam in 
schooL To many individuals concerned with testing and 
measurement, traditional methods do not seem to be ade- 
quate to meet this need. These people believe that although 

•Thi* wwk wta begwn at the Center fnr the Siudy t>< Eval^Btwo lO the 
Cnutuat« School of tCducatton. UCLA 



existing measures are useful for several extremely im- 
porUnt educational purposes, like predicting who will suc- 
ceed in College, they are not necessarily a|propriate ft)r 
many others, like describing what students liave learned 
in school. It is within the context of an increasing denia;>d 
for in struct ionall:,' sensitive nrieasures that the move 
toward criterion -referenced tests (cuts) has gained mo- 
mentum. Unfortunately, in their haste to develop and use 
cRTSi few people have paused to consider the properties of 
CRTS or to systematically evaluate the jnerit of existing 
ones. This paper attempts to malre up ior this omission by 
providing a system for describing and evaluating c&ts. 



'nwmt«Tial in tho publication wii» prepared pursuant lo a contract wtth the I^ational xPiiilnxeoi Education, U S Department of HealthL Edurationand 
^^dfare t^ontracLorj undertaking such projects under Government aponsoMhip arc encouraged to express freely their judgment in professional and tech 
nical matters Pnor Lo pubLieMmnt the manoscrtpt wa»submttt«d U> qualified professional for critical revie^^ and determination of profesaionaT competence 
ThK puhJication has met such standards Points of view or opinions, however, do not necessarily represent the offictal vtew or opink>ns of either t^e^ re 
viewer* or the National Institute of Education. 



CKITKHION KEFERENCED MEASUREME;N1^ A DEFINITION 



A cnUrTion referenced test vs designed to provide a mta 
sure <^f the extent to which an instruclional task or skill 
has been achte%ed. Three of the definitions most often usc-d 
are* 

1. "A criterion -referenced test is one that is deliberately 
constructed to yield mea$turements that are ilirectly 
interpretable in terms of specified performance stan- 
dards. * . - Performance standards are generally 
specified by defining a class or domain of tasks that 
should be performed hy the individual" {7}. 

2 ' A pure criterion -referenced test is one consisting of 
a sample of production tasks drawn from a well- 
defined population of performances* a sample that 
ma> be used to estimate the proportion of perfor- 
mances in that population at which the student can 
succeed" (9), 



3. Criterion refeienced measures art; those which are 
used to ascertain an individual s status with respect 
to sonje criterion, i.e.. a performance standard^^ (18*. 



All CRTS have several features in common: 

1. They are based on cJearly ^^fmed educatione? tasks 
or objectives. * 

2. The test items "an; specifkrally designed to measure 
performance on these tasks or objecttves. 

3. Scores are interpreted in terms of attaimnent of a pre- 
set criterion or level of competence with respect to the 
^ucational tasks or objectives. 



THE CRITEKiON-KEFERENCED TEST DESCRIPTION AND EVALUATION RATING SYSTEM 



Thtre art* at present* a number of tests or test systems 
Ubtled i-riUrion-refereriQed. and these differ considerably 
in format, design* and function. Some crts* for example, 
con^st of man\ small tests and are intended for classroom 
u^c^s Ukt the diagnosis and placement of students* while 
othtrs contain onl\ one or a fe^ tests and designed for 
e%a!'jatjun purport., like establishing the effectiveness of 
jn t.><l ucatjonal program. In order to pro%ide an effici it 
and obje<;ti%e procedure for describmg. assessing^ and 
comparing ail crts» the Criterion Referenced Test Descrip- 
tit>n and Fa aluation (crtoe) rating system was developed. 
.■\ cf»mptete set of instnictions to accompany the system 
wa^ also prepared &o that the crtdb could be applied tn a 
standardized waj by school and district- lev el personnel* 
test- selection committees researchers interested in crts 
for their Ow n work» and professional evaluators. 

The ?RTDB system incorporates two areas of concern. 
Kirst. it IS concerned with the overall characteristics and 
usability of a crt* including a description of a crt's 
marketing and packaging features* its administrative usa 
biiit> » and examinee appropriateness.* Second, the crtdb 
i^ystem is concerned with the specific attributes that 
constitute the technical quality of the test development 
and analysis process* including the funetion and purpose 
of a CRT. the generation of objectives and items* schemes 
for interpreting cKT scores, and the analysis and validation 
of ttems and tests. 

It should be noted that depending upon the uses to 
which a cttT is put* certain items in the system will assume 
more or lcsi> importance. Ideally, these items would have 



•Tht dcvciopnient ol tht- t'nionon lt<;fm?ncfl*l Twi Dea^nption and 
h\aiuatLon form K^id^ hK cbc u^A^l leitt cvjtluAtion procedure u2) 
»¥A^ i» on acmnyin refl«lmp four critical aron* of inlcrwt U> teal u^vrji 
m«as«rrmi?m validity. exammM? appmpnflUTies*). odmrnisirfttive u?Mxbil- 
Ay i»nd nunncfl ^thmtnl oxcoll^-nrt; The cat4*pjnt;s i/i admini^traUvc 
Us(ibdit> and exommiv appropfmtrr^^>^*i m th(* tRT^K havu bc*n pamcu 



been identified and weighted accordingly. Unfortunately, 
it was not Possible to develop a single set of weights that 
would be appropriate for all crt uses. For example, the 
number of test forms is always an important facto^Jn 
determining the usability of a cax* however* for elassroom 
purposes* it is particularly desirable that the crt consi&t of 
several short tests that can be administered throughout 
the year (and a high weight should be assigned to one that 
has this property), while for evaluation purposes, it is 
particularly desirable that the crt consist of a single* 
comprehensive test form <and a high weight should be as- 
signed to a CRT with this feature). In view of this* two 
illustrative sets of weights were developed for a subset of 
the CRTDB Items that were thought to represent definitive 
properties of a crt and. to be crucial facton^ in estabhsh^ 
ing a CRTS usability. The sets oi weight!^: '^ere patterned 
after two typical, but different* crt uses. I } as a classroom 
resource and 2) as a tool for evaluations involving two or 
more instructional programs. 

In developing the crtdb rating fwrm* the two areas of 
amt^m — o\ erall characteristics and technical quality — 
wea^ organized into eight components that are relevant to 
the de^^cription and evaluation of crts. 

Components of the CRTDE Rating Form 



Overall 

riiarac ten sties 
and Usability 


L Marketing and I^^ckaging 
2 Rxatninee Appropriateness 
Administrative Usability 


Technical 
Quality 

1 


4. function and Purpose 

*5 Objectives Development 

6. Item Development 

7 Methods of Score 1 nterpretat ion 

8. Analysis nnd Validation 



2 



inatics. in order to idenlify all knowl^e and skills 
that must be acquired if the area is to be learned (10. 

4. Theories of learning and mstniction. A literature re* 
view 13 conductt<l and/or consultants called in to 
fonmilate series or hieritrchies of educational ta^ks 
and purposes based upon the results of p6ycliologtcal 
theory and research (201. 

5. Emptricd studies. Experiments are conducted in 
ojder to identify the objectives that ^ most im* 
l>ortdnt because the skills and knowledge are inher- 
ently essential. 

No mt^tter hi^w they are derived, educational tasks and 
purposes are usually called objective* or behavioral objec- 
tives. H^evet , it should be'fioted that these terms b&ve a 
precise meaning to educatow: ''An objective is an intent 
[author & itrlics] communicated by a stat«aent describing 
a proposed change in a leamer--a statement of what the 
learner is to be like when he has successfully completed a 
learning experience" (25). Developors of cbts do not 
alweys use this definition in its pur^t sense. To them, an 
objective refers to the content that is supposed to hat^e 
been learned (for example, equivalent and noneq^^' ilent 
sets in sUth*grade math) and only sometimes includes the 
behaviors the student is supp(^ed to exhibit (suc|i as 
naming the first five presidents of the usa). 

There are several issues involved in the formulation and 
generation of objectives. An important one relates tj the 
rules needed for writing objectives and how broadly or nar- 
rowly they should be stated. Foniial rules for generating 
and stating ol; v^ives are needed to ensure theuniformity , 
manageability, and comprehensiveness of the set of objec 
tives or domain that the crt measures.* Still another issue 
deals with how a domain|}s organized. The objectives for a 
Single dotnain can.be grouped by grade levels; they can *?e 
organized according to major content areas; and/or ^they 
can be arranged into a hierarchy according to the complex- 
ity of the behaviors involved or the order of instruction. 

Formuiating and generating items. Once the objectives 
for the CRT have been chosen, the next step is to construct 
and/or seject test items to measure the objectives. Ijhis is 
One of the most difficult steps in the total developmental 
process because of the vast number of test items lhat 
might be constructed for any given objective, even those 
that are relatively narrowly defined (24). For example, 
consider the following objective: ''The student can com- 
pute the correct product of two single<digit numbers 
greater than zero where the maximum value of t his produa 
does not exceed fifteen.'* The specificity of this objective is 
quite deceptive since there are 32 pairs of numbers thH 
meet this lequirement and at least 10 different test item 
types that might be used to assess student performance^ 
as shown in Pigut^ 1. 

Further, each of the resulting 320 combinations of pairs 
and item types couid be modified in a variety of w^ys that 
mfght influence whether they were answered correctly. 



*The of objectives that acrrmeasuiR!) is sometimes catted t domain or 
umverse of content (30. 5) However. Uie term dommn h u^ed by others 
to mean the tulcs tor generating t«st ktom« to measure a sp^afic objec- 
tive (IS). 



Some 01 these modifications are: 

• use different item formats (multiple choice vermis 
completion) 

• change the mode of presentation (written versus oral) 

• change the mode of response (written versus oral) 



FIGURE 1 

TYPES OF GET TSST ITEMS USING THE 
NUMBERS 3 AND b 

5 

a. ic3 

b. 5 X 3 = 

c. (5) (3) = 

d. 5 ■ 3 = , 

e. 5 times 3 = 

f. The product of 5 and 3 " 

g. 5 X _ = 15 

h. If X = 6 and y " 3, what is the value of xy? 

i. What number mult^lied by 3 wiU equal 15? 

j. John has 5 apples. Sally ha<» 3 times as many apples uj 
John. How many apples does Sally have? 



It soon becomes evident that a highly specific objective 
can have a potential item pool ofwell over several thousand 
items (14, 15,-2). 

The number of items to construct for each objective Is 
influenced by several factors including the amount of test* 
ing time available and the cost of making an interpretation 
error, such as saying that a student has achieved mastery 
when he or she has not and therefore erroneously con* 
eluding that a student should not participate in a college 
preparatory prognun. per some objectives many items are 
needed tn order to obtain a stable estimate of a learner's 
performance, whereas for other objectives fewer items will 
suffice. 

A related issue in the construction and generation of 
CRT items is the degree to which the tterrs should b^ 
sampjed with respect to their relative difficulty and pos- 
sible content coverage within an objective. Jt is a well* 
known and frequently used principle of test construction 
that even slight changes in an item can affect its difficulty. 
The extent to which the items within an objective are 
sampled with respect to difficulty has a direct bearing on 
the interpretation of the scores obtained. IndtherwordSt if 
Only the most difficult items are used, the phrase ''achieve- 
ment of the objective'* has a veiy different meaning than if 
the it^ms are sampled over the full range of difficulties. 

Another issue concerns a cet's curriculum match--the 
extent to which a cut is designed for use with a specific 
educational program ( 1 . 29) . c^^rs with a greater degree of 



■i 



cuiTiculum match iiovo objectives and test items that are 
acsocitited with a particul{;r curriculum or set of educa- 
tional maieriais and techniques, crts with a smaller degree 
of curriculum match* on the other hand, contain objectives 
and test items that are not necessarily asaociated with the 
specific {tktUs or content of an educational program. How- 
ever* such CRTS stU! may have been develc^^ed from several 
educational programs and consequently have objectives 
and items that reflect th<: bias inherent in these programs. 
Conversely* crts with no curriculum ir^atch Qre baited. on-a— 
domain of content and behaviors that is mdeper:dent of 
any educational program and* therefore* can be used to 
compare several different educational programs. 

Consideration of the various issues involved in item 
generation for crts has produced a number of different 
strategies for generating and constrjicting items. These 
include assembling a panel of experts ifl3). using content/ 
process matrices applying formal item generation 
rules (14. 15. 5* 27^. 

Formulating score interpretation schemes. The uniquely 
distinctive feature of a CRT is its ability to provide a means 
for describing what an individual {or group) can do* know* 
or feel without havin^; to consider the skills, knowledge, or 
attitudes of others. Consequently, crt scores are reported 
and interpreted in terms of the level of performance ob- 
tained with respect to the objective(sJ or domain on which 
the CRT is based. This type of score reporting is very differ- 
ent from that used for norm -referenced tests in which 
scores are reported in terms of the perfoimance pf other 
mdividuals o*- groups. * " 

1. Actual score. The number or percent of items correct 
On a given objective* referring to the number of items 
actually passed on the test . 

2. True score. An ind;?iduars or group's true level of 
performance On an' objective* refernng to the portion 
of the total iiniy^rse of items for an objective that an 
individual or group could answer, correctly 

3. xVfasteiy of a given objective. The acliievement of a 
preset cnterion level of performance is called mastery 
01 aiyobjective. Criterion levels can be selected arbi- 
tr^y or can be justified using experts' judgments 
a^/or the results of empirical studies. 

4. Performance time. The time it takes, in class hours or 
calendar days* for a student to achieve a given per- 
formance level 

5. Level readiness A score that reflects the probability 
that the student is ready to begin the next ievel of in- 
struction {this may be based on both the number of 
items correti and the f^attem of answers given to 
these items) . 

b. Tutal individuals ^^ho passed. The numbe; of indi- 
viduals or groups who passec? r mastered each objec- 
tive or item (Thi^^ score is given most often for indi- 
vidual item^ when only one item is tested per objec- 
tive— for example. National .Assessment of Educ^ 
tional l^rogrossJ 

* . Total objectives mastered. The nuniber of objectives 
passed or mesU;red by aa individual or group 



It should be noted that scores On crt tests need not4)e 
limited to just a crt interpretation. Other score interpret 
tations can also be provkled to expand upon the crt inter- 
pretation (2 1 * 4* 8) . For example* one n ight say that ''This 
school had an average score of 5 out of 10 On the objective 
(a car interpretation), which is one standard deviation 
below the national average of 7 out of lO (a norrn-refer^ 
cnced interpretation). The notion of using both types of 
score interpretations does no t redu ce 'the theoretical 

lOndness-of-the-^conrintef pTetation*TTr21 , 23). 



Va^^dation of CRTs 

When construction of the objectives and test items is com- 
plete* the CRT must be analyzed and validated. This process 
can involve giving the test to students and studying their 
responses {response data) or relying upon review by ex- 
perts (judgmental data). 

There is much ambiguity about^ the procedures for 
analyzing and validating crts. Nevertheless* there are 
several dimensions of test and item quality that are con- 
sidered to be relevant to crt validation and that have asso- 
ciated with them review procedures, data^coUection strat- 
egies, experimental designs^ and statistical indexes: 

Estabtishing item quality. Following are several com- 
monly considered dimensions of item quality: 

X. Item^objective congruence. A test item is considered 
good if it measures or is congruent with the objective 
that it is supposed to assess. Item -objective con^ 
gruence can be established by using judgmental data, 
'^pically* content experts ate given a variety of ob- 
jectives and the items used to measure them and are 
asked to assign the items to their appropriate objec- 
tive or to comment on the appropriateness of the 
item -objective relationship. 

2. Equivalence (internal consistency within objectives). 
A test item is considered good if it behaves like other 
items measuriog the same objective. The concept is 
similar to item -objective congruence* but ^ts proper 
use depends on response data. Equivalence is usually 
measured by computing the biserial correlation be- 
tween the score on an it^m and the total score On all 
items measuring that objective. 

3. Stability foyer t^me). An item is considered good if 
examinee performance is consistent from one test 
period the next in the absence of any special inter- 
vention {such as instruction which is an intervention 
that can change examinee performance). Stability in- 
volves response data and can be measured by using a 
phi coefficient that correlates scores On the item from 
two different occasions as long as too much time does 
not elapse between them. 

4. Sensitivity to instruction- An item is considered good 
if i*. Is sensitive to instruction^that is* if there is a 
discrimination in responses to the item between those 

' who have and those who have nnt benefited from in- 
struction. This measure 6i item quality is usually 
comi>uted for crts that are linked to particular educa- 



4 



Clonal programs ami it re^iuirt's response data. Txam- 
. inees are tested before and after an educational pro- 
m gram, and items that many examinees fail before in- 
struction but pass after it arc considered to be sensi- 
tive to the instruction, 
5. Cultural^ sex bias. An item is considered good if it 
docs no^ lead to inaccurate conclusions about the per- 
formance of different cultiiral groups or sexes. Bias 
can be assessed using either judgmental or response 
data. If the former are used, represt^ntatives oi differ- 
ent cultural groups, members of each sex. and/or 
linguists examine test items to determine whether 
vocabulary or content are foreign or coiild be mis- 
interpreted. If response data are used to assess bias, 
they are analyzed, typically using analysis of variance 
or regression techmques for item-cultural/sex inter- 
actions. 

Establishing test quality. There are six dimensions 
commonly used to escpress the quality of a crt: 

1, Test-objective congruence. Similar to item-objective 
congruence, test-objective congruence is an index of 
theextent to which the total test or subtest measures 
the stated objectives. Test -objective congruence is 
usually detennined by using judgmental data, 

2. Equivalence (internal consistency). Test equivalence 
is a measure of the homogeneity of test items for an 
objective: that is, how coherently the test i^ms 
assess the psrticular objective. This can be measured 
by using split-half correlations, Kuder^Richardson 
formulas, o^ coefficient alpha, 

3a. Stability (test^retest, or alternate forms), A test is 
stable to the extent that examinee responses are con- 
sistent from one test period to another or across alter- 
nate forms of a test in the absence of any interven- 
tion. Stability is usually measured by using correla- 
tion techniques, 

3b. Stability (number of items per objective and number 
of objectives per ddmain). There are two levels at 
which this type of^tability for a crt can be estimated. 
At theTirst level, a determination is made of the nunT- 
ber of items that should be tested in order to obtain a 
stable score on an object.' ve. For this type of stability, 
the assumption is niade that for each objective there 
is a pool or population of items with mixed difficulties 
that measure the objective and that for any given test 
a sample of those items is selected. At the second 
levels a determination is niade of the number of objec- 
tives that shoirld be tested in order to obtain a stable 
estimate of peH >rmanceon the domain. For this type 
of ^tability^ th€ assumption is made that a single 
score is needed that describes an individuars per- 
formance on the domain or set of objectives. Stability 
can be estimated with response data using correlation 
techniques and/or Bayesian statistics (26), 
4 Sensitivity to instruction. A tcsc's ability to discrim- 
inate between those who Kave and those who have not 
benefited from instruction. This measure of test qual- 
ity is usually obtained for cbts that are linked to a 
specific educational program and is obtained using 
response data. 



5. Cultural/sex bias. Bias occurs v/hen a test leads to in- 
accurate conclusions about the performance of cul- 
tural/sex groups. It can be estimated with ar*Uysis of 
varianceor regression techniques using response data, 
or by expert review using judgmental data, 

6, Criterion validity. Criterion validity establishes the 
meaningt^ilness of the criterion in terms of which car 
scores are interpreted. Establishing criterion validity 
is either a one^step or a two-step process. The first in- 
volves assessing the meaningfulness of the domain: 
that objectives tiavebeen selected and organized to be 
in themselves educationally significant and that test 
iteirs have been systematically generated to cover the 
objectives. Step ! criterion validity is usually estab- 
lished by having experts review the objectives and 
tear items to determine the extent to which they were 
developed in conformance with prespecified pro- 
cedures and the extent to which they cover the domain 
in a comprehensive and meariingful manner. 

Step 1 must be coniplsted for cli cets, and, in some 
cases, is sufficient for establishing cnterion validity. One ^ 
example is a crt that is based on objectives that are nar- 
rowly defined and operationally stated in such detail that 
generating test items only requires transposing them into 
question form, cht score interpretations for objectives 
with these characteristics are meaningful because the 
objectives describe skills that can be measured directly by 
test items, A second example is when the crt*s objectives 
are linked to a curriculum and its scores are intended for 
and interpreted by teachers and curriculum experts, crt 
score interpretations in terms of these types of objectives 
are meaningful because the skills and knowledge being 
measured are those taught in classrooms where a specific 
curriculum is taught. 

The second step is established through empirical means 
and involves determining whether examinees who perform 
well on the test have really achieved the educational objec* 
tive. Step 2 criterion validity can be measured by compar* 
Jng^scpres^obtained^on a caiLby in dividual s„who._ in. ad*_. 
vance of taking the cut and using iniiepen^ent criteria* 
were judged to possess or not possess the skills that the 
objective is intended to measure. To ^he extent that the 
CRT discriminates between these two groups of individuals, 
the CRT has criterion validity,*^ 

By establishing Step 2 criterion validity, the relation- 
ship between test items and the objectives they are sup- 
posed to measure is empirically confirmed. Step 2 criterion 
validity permits assertions about masteiy of the individual 
objectives that comprise a domain and about more complex 
l>ehaviors whose component parts are defined by the 
domain. Step 2 criterion validity is particularly useful 
when It may be difficult to automatically assume that 
achievement of the items necessarily reflects achievement 
of the larger objective or domain. 

The question of classkol reliability and validity. There 
has been considerable debate over the appropriateness of 



•Sicp 2 cnttnon validity is similar to construct vqhdity and/or discnnn, 
Lnuni vaJjUjiy hut an objectKc or a domtiin, rather than a psychological 
^t3t«, 13 the connlTUct. 



^classical" (long and witlely used) indexes of reliability 
and validity tcf criterion- referencttd tests. Some psy- 
chomecncians have argued that since crt items are selected 
to measuit; achievement of specific educational objectives 
i^nd not u> discriminate among students* scores on crts 
can lack variation. This could arise in the following situa- 
tion: Before instruction* noue of the students have- 
mastered the objectives* and all might receivt a score of 
£ero on the criterion- referenced pretest. After instruction* 
however* all might receive v^ry high scores on the crite- 
rion-refereuced posttest. A lack of variation in student 
scopes* It is claimed^ would cause the traditional indexes of 
reliability and validity Uhat are based on variance) to be 
inaf>propri&te (2Sn 

.Others have argued that whej.citT? are administered to 
u heterogeneous sample representing differing degrees of 
competence and receiving differing instruction on the ob- 
jective* thero will be sufficient variation in test per- 
formance to apply the classical statistical formulas (21* 
12). This latter stance is becoming the accepted view* and 



Oeneratitig Review Criteria 

Currently available crts were reviewed to determine if 
they were technically sound and if they coutd be easily 
used for a large-scale effectiveness evaluation. To struc- 
ture the review* a set of criteria were generated. The cri- 
teria reflect the characteristics generally accepted as being 
necessarv and appr^)priate for a large-scale effectiveness 
cvhIuatioTL Consequently* many of them could be applied, 
to norm -referenced tests as well as crts. In ord**** to obtain 
the criteria* several sources were consulted * including a re- 
view of the literature* requests for proposals issued hy 
state and federal agenCies involving large scale evalua- 
tions^ and criteria already developed and used for review- 
ing achievement tests. 

Obtaining CRTs 

A b3t of publishers of educational tests was compiled using 
lest revieV'' books (3* 16* 17* IS)* personal contacts* and 
library sources (24). li should be noted that publishers on 
the list were not necessarily known as marketers of crtv 
because *t was not always possible lo predict in advance 
who published crts and who did not and because it was 
considered important to include as many publishers as 
possible in the review, 

A letter was sent to each publisher requesting informa- 
tion about: any criterion-referen*^ math cr reading tests 
they might have available {including detailed descriptions 
of the test battery at each available grade level); sample 
tests for reading and math at each available grade level; 
lists of objectives or domains for reading and math at each 
available grade level; directions for administering ^nd 



it is now held that the classical indexes (such as stability, 
eQuivalen':e} can be estimated for crts using a hetero^ 
geneous population. 



TheoretieaJ Value of CKTs for Evaluations 

Based on theoretical considerations alone* are crts appn>- 
priate to measure achievement for large-scale effectiveness 
evaluations? 

An effectiveness evaluation requires instruments that 
are reliable and valid and that provide meaningful scores 
thai can be used to make decisions about ^ucational 
policy* In theory* there is an orderly set of developmental 
and vaiidationpTOcedures which^f fblibwed properly, in- 
duce CRTS that are based on well-defined sets of objectives 
and that can provide: eaningfuland useful score interpre- 
tations. Thus* fron^ a theoretical perspective* crts are ap- 
propriate and desirable for measuring achievement^ in 
e:fectiveness evaluations. 



scoring reading and math tests at each available grade 
level; all technical manuals, field test reports, expert re- 
views* or te,st-analy$i5 information; information about 
special features like scoring services or cassette-recorded 
directions; and cost information 

From publishers' responses* 28 crts were obtained thai 
had sufficient information for review purposes. Each crt 
was independently reviewed twice using the set of criteria 
generated for this purpose* and discrepancies were resolved 
by both reviewers. Any remaining questions, usually re- 
sulting from unclear or insufficient information from the 
publishers* were followed up with a pjione call* 

Explanation of Review Criteria 

There were 19 criteria against which crts were reviewed. 
(Figure 2 shows the form used by reviewers-) For this re* 
view, reading and language arts were considered to be one 
subject area and mathematics another. All subtest-s or 
tests of individual objectives at the same level were 
grouped together and considered as a single reading or 
math test. In addition* thecriteria were especially designed 
to permit the cross-grade level and longitudinal compari- 
sons that typify large-scale evaluations. 

1. Coverage of specific skills. A test had to cover basic 
skills in reading (language arts) or mathematics. 

2. Grade-level coverage- Forms of the test had to be 
available for grades 1 to 9 in order to make possible 
comparisons across grade levels as well as longitu* 
dinat comparisons, (High school-level crts were ex* 
eluded because so few publishers had them available.) 

3 Overlap of objectives across grade levels. Some or all 

7 



REVIEW OF CURRENTLY AVAILABLE CHTS 



of Lhe ttjsi's ub}oitivt;a hiui In* iiif^usurcil at each 
gradt> lev elm onler topt^rtnit comparisons of commun 
educational objectives across grade Itjveis or over 
time. 

4, Number of test forms per grade level Due to con* 
strainU related to test adininistration and the time 
available for testing* ther** had to be a limittKl num^ 
ber of lest forms at each grade level* Just dnS'test per 
grade level waj? pref^irred in order to avoid problems 
with reliability that could arise when several test 
forms are combined. 

5, Directions for test administration. A test bad to pro- 
vide thorough and clear instructions for both the ex- 

. amiiier^aiKl_. examinee* Directions concerning dis* 
' tributing tests* demonstrating sample questions^ and 
test administration had to be provided in a detailed 
and easy to read form, \ 

6, Special equipment for tes^ administration,, Because of 
the logistics and costs in votvc^l (n large- scale inform a ^ 

^ ^iio^^collwtion* test administration could not involve 
any sp^a^equipment (like cassettes or visual aids) 
aside from perVcil^nd scratch paper, 

7, Time for testing, A test had to be designed to be com- 
pleted within a given class periodr the amount of time 
usually available to outside evatuators. 

Group testing. A test had to be designed for group 
administration* since individual administration is 
prohibitive in large scale evaluations, 

9. Item objective match. Each tes^ item had lobe coded 
to an objective lor the edut^ational tasks and purposes 
the test claimed to measure), 

10. Objective coverage. There had to be a sufficient num- 
ber of items to adequately measure each objective, 

11. Objective/subjective scoring, A test had to use an ob- 
jective sconng procedure since it would be very costly 
to train individuals to usesubjective scoring schemes. 

12. Machine scorable. The test had to be available in^ or 
adaptable to, machine scoring, 

13. Score interpretation scheme, A test had to employ a 
criterion -referenced score-interpretation scheme, 

14. Keusable materials. To save money* reusable test 
booklets and test manuals were requested. 

15. Curriculum match. A test could not be based on the^ 
objectives of any particular curriculum or educational 
program , 

16. Costs of tests per pupil. The costs of testing pupils 
had 1} be kept low enough to accornmodate a large- 
scale study, 

17. Formal field test, A test had to provide documenta- 
tion of field test activities. It was preferred that the 
field test participants be nationally and gco^aph- 
ically representative* be a probability sample* and in^ 
elude sufficient numbers of minority persons to permit 
an estimation of bias, 

18. Informatiun on item quality, Jnfonnation had to be 
provded. based either on judgmental or response 
data* about item stability, sensitivity to instruction, 
sex/ cultural bias, item-objective congruence^ and 

' equivalence, 

19. Information on test quality Information had to be 



provided on test quality* I ^sed either on judgmental 
or response data, to include information abOL: in- 
ternal consistency* test stability, test*objective con- 
gruencCt sex/cultural bias, sensitivity to msmiction. 
and criterion validity. 

Results of the Review 

The resuits of the 28 tests reviewed are presented on page 
It should be noted that because many of the 28 crts 
were intended as classroom resources and not for effectlve^ 
ness evaluation purposes, the review conducted for this 
investigation tended to make some cuts look less excellent 
than they would have if they had bean reviewed from 
another perspective, ^ 

1 . Coverage of specific skills* Of the 28 tests reviewed, 
15 were designed to assess reading skills, and 13 were 
designed to assess mathematics skills. All 28 tests re- 
viewed focused on measuring basic skills in reading 
and/or mathematics and thus met the criterion, 

2. Grade*level coverage. Nine tests were available for 
grades K^9, and thus met the criterion. The remainder 
varied from crts available for grades K 2 to those 
available for grades K-8, 

3. Overlap of objectives. Twelve tests appeared to mea^ 
sure the same objectives at all grade levds. Sixteen 
tests appeared to have some overlapping objectives 
which were measured at most* but not all, grade 
levels. It should be noted that to make common ob^ 
jectivest test publishers frequently used broadly 
stated objectives or skill categories which they then 
translated into tasks or skills of varying complexity 
for different grade levels, 

4. Number of test forms. Some crts had only one test 
form pf^r grade level and others had as many as 3L 
U,*5ually those crts that offered a limited number of 
test forms per grade level would include several ob^ 
jectives on a single test form* while those featuring 
more test forms pi ' grade level would assess one or 
only a few objectives perform. Three tests did not set 
limits on the number of tests that could be created 
from their bank of objectives and items. 

5. Directions for test administration. Twenty seven of 
the tests met the criterion by providing adequate 
diroction,s both to the examiner and examinee for test 
administration. One test contained no information 
about administration, 

6. Special equipment required. Twenty six tests re* 
quired no special equipment for test administration 
and* thereforet met the criterion. Two tests required 
the use of tape recorders or cassettes, and one test 
provided no information. It should be pointed out 
that many of the 26 tests that do not require special 
equipment are designed* nevertheless* for use with 
special equipment and consider its omission to be un* 
desirable. 

7. Time for testing. Only two testy met this criterion 
More tests (24) left time for testing open* but from 
their length appeared to require more than one hour of 



8 



FIGURE 2 



THE REVIEW FORM 



Specif ications for 
Selecting Tests 


Reviewor's 
Notes 


Criterion 
RatiDg/Kajige 


Tt^is must cover basic skills 
m lanjguage arts/ matliematics 




P F U 


grades \ thrf>ugh 9 




P ^ 11 


grade iex^els 

Same objectives ahow'd be 
mtfa^ured tf^ich ^ade level 




S A N U 


Number of t^sc forms p^r 
Should be a limited numberof 






Complete directions for test 
fidministration 




P F U 


administration 


— 


P P u 


Time for testing 






Group testing 




r r u 


Test items must be keyed to 
objectives that caiL 
uroaul/ scac^u 




P F V 


Objective coverage 

Test Item should adequately 

cover each objective* 










P p tl 






P F U 


Score mterprctction 

Must becriterionrefere'^ced 


> 


P F U 


Reu^abk macenals \ 




P F U 


f^ijmCijf ij trt Tflti f^n. 

Test cannot be based on 
specific curricula 




S A N U 


Cos t of tes t per pupil 






Format fieid test 
Prcferabl/sl»uld have 
al national "^dope 

b) geographic scope 

c) minority representativeness 

d) probability sampling 







{continued on next page) 



KK^ J* -ptLSS- 1 -faiL H- sonwilimvsH A-niways, N-m*v*r. U- unclear* not wtated- slated wuh(iUl3upportiitgdo<;um«nu**ion 



specifications for 
Selecting T**3U 



Notes 



Critfirion 
E'':ting/ Range 



item quality ittfonnntion 

Judgmenui* or response <lata: 

^) in atniclicmil sensitivity 

b( stability 

c} s«[x/culturalbifts 

d) item objective a)ngiiiertce 



TVfif Qiiciity information 
Judgmental or response lialJi: 

a) internal consistency 

b) stability 

c) test'ohjectiv^ congruence 

d) aujt/cultut^lbiad 

e) instructional sensitivity 

f) Criterion validity 



Kh\ V- pass, y <aiL S-»onurtime<). A - fitway?. N - np\ trr. U - undoar. notstAt^. stateo vmhoui 9tippo:tJng docum^ntatjon 



ERIC 



io 



testing teine. One c»T had no information about l^c 

tt^e needed for testing. 
S. Group testing. Tuenty-five re^Jts could h& ndm^nis- 

tertjd to ^ups and therefore, met the critonOn. Two 

;eats ^/ere designed for in(fwtdu»! administration 

only> and one 6\d not piX^vJ"" ilih mforttiatiou. 
9. 1 J. etn- objective ma^<-h. Vwnt,* -six t«3Ls hudeach lUim 

codod 10 objec:ive, and :-t,. cat did not p.^ovide this 

inform&drtff. 

Chjisrtive coven? rn - irenis teitea Jor ^di jbjec- 
it\^ Tttn^. 1 160 across the 28 u^. (It 
^^tMaI^i^^TlOwe^i£^*^>■ thecfiT^'itS 150 i^^^rr.s per ofcjoc- 
viv^ was lasvd on ^ computenzei^ i*vm bank fr n 
whfch u,>ts of itiy ?«n^h could I>f rnt^eiate;^ ^ 
IL Objective/subjecuve sconng. Twenty- s<5v*n testa 
employed an objective scoring technique, meeting 
thii cnlerion. One lest employed a subjective tech- 
nique, and one other catdid net provide thi* inform?- 
tion. 

12. Machine-sciiring optiori. Eighteen tests met the crite- 
rion for machine scoring. Nine crts were hand- 
scorable onlyi and one cet did not provide this in- 
fomtation. 

13. Score- interpretation scheme. Twenty^seven testa met 
the criterion by using some type of criterion-refer- 
enced score interpretatbn scheme. Overwhelmingly, 
the scheme was expressed as ttn arbitrary mastery/ 
noninastery score or the number of items correct on a 
^iven objective. Of these same 27 tests. 7 also em- 
ployed norm -referenced interpretations. One test did 
no. descnbe its scort;- interpretation scheme. 

14. Reusable materials. Twenty-four testa were designed 
so that at least some portion of the n^aterials could 
be reused. These usually were the test booklets, when 
separate answer sheets were provided, and the teach- 
er's and examiner's manuals. Three crts had no re- 
usable materials, and one did not provide this in- 
formation. 

15. Curriculum match. Twenty-two testa appeared to 
have no match to a particular curricuJum or instnic- 
lionat program. Six other tests also appeared to be 
rather general, although they claimed to be hased in 
varying degrees on a review of what is currently being 
Uught in today's schools. 

16. Cost of tests per pupil. Based on apurcbaseof all testa 
in reading oi math at ihe third grade level, costs, ior 
a minimum purchase, ranged from about five cents per 
student to $6.31 per student. One test had to be imple- 
mented at the district level and cost $7,500. Most 
tests are so\6 in minimum sets of 30-35 test booklets. 

17. Formal field testa Eight tests provided documenta- 
tion concerning field test activities. However^ the in- 
formation provided was remarkably sparse with sev- 
eral exceptions. Those who did conduct field tests 
usually attempted to get some sort of geographic and 
national representation. Fifteen testa claimed to have 
been field tested, but provided no supporting docu- 
mentatioti. Five aaditional tests provided no informa- 
tion at all about field tests, 

18. [nformation on item quaUy. Twelve test publishers 



reported having conducted item quality atudh^ bfued 
^n response rieta an<i/or expert re"'iew. Of the*e, 
attention tyi>ically was paid to item-objecti'^e congru- 
ence, item stability or equivalence, ana sensitivity to 
instruction. Eight tests reported having seme type of 
re%~iew hut de(.!ined to state the kinds or extent of 
their studies. Bight oih^r syst^im^ did not prov^e 
'iny information at all. 
19. Info/ination on test quality. Thirteen tests reported 
having conducted tost^ quality studies based on re* 
sponsfi data and/or expert review. Of the >e, interriii 
consistency, stability, ^<es^-objecttve congruence, 
sr:nsit;vUy tA :nitrcrtior. atul criterion validity (Step 
1) wen r303tfarB<)uenti/ attended io. Seven other ays* 
tems claimed to have performed test quality studies, 
but provided no supporting' documentation. Eight 
additional systems provided no information at all. 



Ptacttcal Valtie of CRTs for EvaluaUous 

Bafed on practical considerations alone, are cats appro- 
priate for targe-scale effectiveness evaluations? 

The answer is no. From the review, it is clear that no crt 
fully met dW the criteria. Further, the review uncovered a 
number of serious practical problems that diminish the 
suitjibilky of currently available chts for an effectiveness 
evaluation: 

Many teaming objectives. Most of the crts reviewed 
iiad a large number of very specific learning objectives 
that were associated with very small units of instruction, 
like one to Hve class lessons. The reason for the use of 
many narrowly defined obiectives cafi probably be traced 
to the ongtnal use of cms by teachers as an aid to indi- 
vidualizing and evaluating instruction. Neverthelessi an 
effectiveness evaluation of the impact of just one year of 
instruction at one grade level wouid generate information 
about an enormous number of objectives* thus compUcat- 
ing the maitagen'^ent, analysis, and reporting of data. 

Numerous test forms. Many currently available cuts 
provide separate test fonns for each grade level that mea- 
sure just One or a few different objectives. The appearance 
of many test forms also probably reflects the original in* 
tention £oudecRTsas classroom aids. In tenrs of an effec- 
tiveness evaluation, the logistics of odministering a num* 
b^^r of distmct tests complicates infonnation'/M>llection 
activities and increases the chanc^^ of mailing errors and 
the costs oi conducting the evaluation. 

Maximum time required for testing. Most avaiiablt cats 
take more than an hout of class time, which is the maxi- 
mum time that can usually be devoted. It should be no*^ 
that some of the test publishers, recognizing time cor^ 
straints, offered cats that had just one item per objective. 
However, this is not a saiisfacto^y solution since reducing 
the number of items wilt almost invariably brin^ with it a 
diminution in the test's ability to measure with precision 
each of the objectives although it may have the benefidat 
effect of diminishing tefXing time. 

DUcrepancies between CHT's and program 's objectives 
Using ciirs in effectivene.is evaluations that involve more 



IC 



11 



TABLE 1 
TESTS REVIEWED 



Name of System 



Pubtisher 



Foimtain Valley Teacher Support System-Readiiig 

Fountain Valley Teacher Sup|K>rt System-Mathematics/ 

Prescriptive Reading Inventory / 

Diagnostic Mathematics Inventory 

Comprehensive Tests of Basic Skills 
Form S iCTBS/S)-Readiiig 

Compreheasive Tests of Basic Skills ^ 
Form S (CTBS/S)*MathemaUc5 

ORBIT (Objective's-Referenced Eanlt of 
ItemsondTestsl ^ 

Skills Monitoring System-Reading* 

197$ Stanford Reading Tests 

1973 Stanford Mathematics T^scs 

Individualized Criterion* Referenced 
Testing* Heading 

IndividualkedCriterion-Reierenced 
Test] ng* Ma thema tics 

Woodstock Heading Masteiy Tests— Form A 

Key Math {Diagnostic Arithmetic Test) 

Masteiy: An Evaluation Tool* SOBAR, Reading \ 

Masteiy: An EvalnationTocl-Mathematics 

IndtVidnal Fnpil Monitoring Systems-Reading 

Individual Pupil Monitoring Systems-Mathematics 

Comprehensive Achievement Monitoring 
(CAM( Maintenance Pk^?.-Reading 

Comprehensive Achievement Monitoring 
(CAM) Maint^;nanceFkg,*Mathemetics 

Objectives- Based Test Sets*Reading 

ObjectivGs^BasedTest Sets -Mathematics 

r 

Reading* Analysis of Skills 

Mathematics -Analysis of Skills 

Tests of Achievement in t^asic Skills 
tTABSl -Reading 

Tests of Achievement in Basic Skills 
fTABSl -Mathematics 

Heading Inventory Probe I 

Ataihemaiics Inventory Tests 



Richard Zweig Associates, Inc. 
RtchardZweig Associates, Inc. 
CTB/McGtawHill 
CTB/McGmw*HUl 

CTB/M.^taw-HUl 

CTB/McGmw*HUl 

CTB/McGmw*HUl 
Harcourtt Brace, Jovanovich* Inc. 
Harcourt, Brace* Jovanovjch* Inc. 
Harcourt, Brace, Oovanovich* Inc. 

Educational £>evelopinentCorp. 

Educational Development Corp. 
Americaii Guidance Service 
American Guidance Service 
Science Research Associates 
Science Research Associates 
Hought<Hi*Mifflin 
Houghton- Mifflin 

National Evaluation Systems 

National Evaluation Systems 
Instructional Objectives Exchange 
Instructional Objectives Exchange 
Scholastic Testing Service 
Scholastic Tearing Service 

Educational end Industrial Testing Service 

Educational and Industrial Testing Service 
American Testing Company 
American Testing Company 



*This test w^s not available at press time. 



12 



than One educational program mcana deter mmitig relation- 
ships between the general objectiveft the crts an^ designed 
to measure and those of the programs so that achievement 
c^h he measured m terms of the objectiveis emphasized in 
msiniciion and exemplary programs can be identified. 
However^ obtaining this information is costly and compli- 
cated. Teachers canbeasked^ fbrexamplei to rate thecRTs' 
objectives in tennsof their relevance to classroom instruc- 
tion, but teacher ratings can be unreliable. Insinictional 
experts can be asked to analyze textbooks and curriculum 
guides, however, they cannot know for certain how these 
materials are being used in the classroom, 
' A related problem concerns which objectives to test. 
Each student or classroom can be tested on just those ob- 
jectives derived from the curriculum being used or on a 
sample of objectives some of which may be relevant to the 
curriculum while the others are not. Depending upon the 
choice, the resulting evaluation information can be limited 
m lis usefulness for making comparisons or it may require 
considerable manipulation before interpretatk>ns can be 
made. 

Identifying common objectives across grade levels. The 
same objectives are not always measured at all grade 
levelss or» if they are, there is no system for identifying 
common objectives. The skills $nd content associated with 
an objective generally become more complex at higher 
^ade levels. To make compa^sons over time or across 
grades, however, it is necessajy to identify skills or objec- 
tives that are related in tern:is of a conceptual framework 
or genera) content area. For example, in the fourth grade* 
a punctuation objective might focus on begiiming sen- 
tencefi with capital tetters and ending them with periods^ 
while in the ninth grade, a punctuation objective might 
focus On the proper use of semicolons as alternatives to 



periods. Unless the test publisher has identified the rela- 
tionship betv.een these two objectiv?s— for example^ that 
they both have to do with the same skill area— the evalua- 
tor may be forced to decide this on his own. an instruc- 
tional decision that is not ordinarily part of the evaluator's 
expertise, 

Vnvalidated CRTs. The procedures used to validate crts 
are not very sophisticated^ and field test results are not re- 
ported in any detail. When compared with the highly 
structured field tests conducted for norm -referenced tests* 
most CRTS are deficient with respect to the sample's size 
and representativeness, and/or the 'amount and precision 
of data presented in technical reports, 

Insuffi^ent score information. Most crts report scores 
either aa the number of items correctly answered for each 
objective or sometimes aa mastery or nonmasteiy scores, 
""mastery" meaning correctly tinswering an .arbitrarily 
selected number of items per Objective, These types of 
score interpretations are accepted by theorists as legitimate 
ways ,of expressing crt tost scores and they may have 
meaning for teachers who know their curriculum. How- 
ever, foreffectiveness^evaluation purposes* these types of 
interpretations are inadequate because they provide insuf- 
ficient information for decision making and lose meaning 
outside the cbssroom. 

Financial considerations, A final practical problem with 
using currently avnlable crts for effectiveness -evaluation 
purposes is that irost a ^ costly. This probably reflects the 
effort it takes to ^lefir^e domains, to develop the special 
features offered b;- crts* such as referencing the objectives 
to various school curriculums, and to provide many short 
test forms that can be used efficiently for classroom in- 
struction purposes. 



CONCLUSIONS 



There ts no currently available crt that is feasible for use 
in large-scale effectiveness evaluations. This conclusion is 
hased on practical, not theoretical^ considerations. One 
major reason for the likely inappropriateness of available 
CE7S that many of them have been designed for clas^^ 
room/tuftnot evaluation purposes, consequently, they ar^ 
vt*anict?J^J3!ed by numerous, narrowly defined objectives, 
each measured on a separate test form. In the context of 
an effectiveness evaluation, these crts produce unwieldy 
amounts of information, require too m uch time for testing, 
and create logistical problems for test administrators, 

A second majOi- practical failing of currently available 
CRTS is that field tests are eltber not documented or are 
performed inadequately. As a result, the reliability and 
validity of these ^rts is simply not known- and it is inap- 

12 



propriate to provide decision makers * -ih information of 
unconfirmed quality, 

A third major failing of available crts is that the score 
interpretations given are not as meaning^ul as can be ex- 
pected. Most are presented as numbers of items passed, 
without Step 2 criterion validity information or compara- 
tive da'.a as supplements. Two additional practical failings 
are the crts' costs and the absence of meirhanlsms for 
tracking the same skills or objectives across grade levels, 

A CRT that is appropriate to use in measuring achieve^ 
ment in an effectiveness evaluation should be based on a 
limited set of objectives that represent essential compe- 
tencies and basic skills, should be proven reliable and 
valid, and should he able to provide scores that are mean- 
ingful and useful. 



13 



REFERENCES* 



L Baker, R,L, Measurement considerations in instruc- 
tional puroduct development, paper presented at Con- 
ference on Problems in Objective Based Measure- 
ment, Center for the Stud^, of Evaluation^ UCLA 
Graduate School of Education. Univer of California. 
hos Angeles. 1972, | 

2* Bormuth. J. P. On the theory of achievement test 
items- Chicago: Univer. of Chicago Press. 1970- 

3. Buros, (Ed.) The mental measurements year- 

booh. Highland Park, Harper. 1970. 

4. Cronbach. Bsssntials of psychological testing, 

(3rd ed.) Ne# York: Harper. 1970. 

5. Cronbach. LJ, Test validation. In R.L. Thomdike 

iKdJ. Educational measurement (2nd ed.). Wash- 
ington. D.C.: American Council On Education. 1971. 

6. Cronbach, L.J.. & Suppes. P. (Eds.) Disciplined in- 

quiry for education, Stanford. Calif.: National Acad- 
>^ emy of Education. 1969. 

7. Davis, F.B.. & Diamond; J,J, The preparation of 

criterion-referenced tests. CSB Monograph No, 3, 
Los Angeles: UCLA Graduate School of Education. 
Center for the Study of Evaluation. Univer. of 
Calif^ia. Los Angeles. 1974. - 

8. Ebel. ^ I" Evaluation and educational objectives: be- 

havioral and otherwise. Paper presented at the Con- 
vention of the American Psychology Association. 
Honolulu, Hawaii. 1972. 

9. Finkt A., & Kosecoff. J, An evaluation primer. Book 

in preparation, 1976. 

10, Glaser, R- Instructional technology and the measute- 
^ ment of teaming outcomes: some questions. Ameri- 
\ can Psychologist, 1963. 18 . 519-521. 

IL Glaser, R.. & Nitljo. A. Measurement in learning and 
instruction. In R-L. Thomdike fEd.}. Educational 
measurement (2nd ed.). Wa^ihingtbn, D.C.: Ameri- 
can Council on Education. 1971. Pp. 652-670. 

12. Harris. C. Comments on problems of objective-based 

measurement. Paper presented at the annual AERA 
meeting. New Orleans, 1973. 

13. Harris, M.L.,& Stewart. D.M, Applications of classi- 

cal strategies to criterion-referenced test construc- 
tion. A paper presented at the annual meeting of the 
American Educational Research Association. New 
York* 1971. 

14. Hively, W. Introduction to domain referenced achieve- 

ment testing. Symposium presentation. AERA* 
Minnesota. 1970. ^ 



*lum9 followed by an ED n^mbor (for example ED04& 69^) available 
from the ERIC Document E«;...oduction Service (EDES). Ccciault the 
iRastn<entii%\imtReifounfstnEducatton for theaddr&u and ordoring 
infcnUAtum* 



15. Hively, W., Maxwell, G.. RabeM, G.. Senion. D., & 

Lundin. S. Domain referenced curriculum evalua- 
tion: A technical handbook and case study frcm the 
MINNEMASTptoject. CSE Monograph No, 1, Los 
Angeles: UCLA Graduate School of Education. 
Center for the Study of Evaluation^ Univer. of 
California. Los Angeles. 1973. 

16. Hoepfher, R. et at, CS£ Elementary school test evat- 

uations^ Los Angeles: UCLA Graduate School of 
Education. Cent^ for the Study or Evaluation, Uni- 
ver. of California, Los Angeles, 1971. 

17. Hoepfner. R. CSE Secondary school test evatuations: 

Grades 7<£ 8, Los Angeles: UCLA Graduate School 
of Education^ Center for the Study of Evaluation. 
Untver. of California. Los Angeles. 1974. 

18. Hoepfner. R. etai, CSE-ECRC Preschool/Kindergar- 

ten test evaluations, Los Angeles: UCLA Gradu- 
ate School of Education. Center for the Study of 
Evaluation. Univer. of California. Los Angeles. 
1971. 

19. Hofstadter. R- Anti-intellectualism in American life. 

New York: Vantage Books. 1963. 

20. Keesling. J.W. Identification of differing intended out- 

comes and their implications for evaluation. Paper 
presented at the annual meeting of the American 
Educational Research Association. Washington. 
p.C: 1975. 

21. Klein. S.P. Evaluating tests in terms of the informa- 

tion they provide. Evaluation Comment, 1970, 2(2), 
1-6 ED 045 699. 

22. Klein. S.P. An evaluation of New Mexico's educational 

prtorities. Paper presented at Western Psychological 
Association. Portland. 1972. TM 002 735 (ED num- 
ber not yet available). 

23. Klein. S.P.. Fenstermacher. G.. & Alkin. M. The 

center's changing evaluation model. Evaiimtion 
Comment, 1971, 2*4). 

24. Kldn. S.P.. &Dosecoff. J.B. Issues and procedures in 

the development of cntenon-referenced tests. ERIC/ 
TM Report 26. Princeton. N.J.: ERIC CUaring- 
house on Tests. Measurement, and Evaluation. 
1973. 

25. Mager. R.F- Preparing inStn,ctiOrtal objectives. San 

Francisco: Fearon^ 1962. 

26. Novi5;!r* M.R.. & Lewis. C. Prescribing test length for 

criterion -referenced measurement. CSE Monograph 
No. 3. Los Angeles: UCLA Graduate School of Ed- 
ucation. Center for the Study of Evaluation, Uni- 
'-er. of California. Los Angeles. 1974. 

27. E^opham. W.J. Educational evaluation, Englewood 

CUffs. N.J.: Prentice- Hall. 1975. 



13 



24 



2S, Popham. W.Jh. Si Huaek, T.R Imj>lications of crite- 
rion-referenced measurement. Journal of Educa- 
tional Measurement. 6(U, 1-9. 

29. Skag^r, R. Generating criterion -referenced tests from 

objectives based assessment systems: Unsolved 
problems in test development, assembly and inter- 
pretation. Paper presented at the annual AERA 
meetingi New Orleans, 1973. 

30, Skager, Critical differentiating charrcteristicrj for 

tests of educational achievement. Paper presented 
at the annual AERA meeting, Washington. 
1976. 



3L Wilson. H. A. A humanistic approach CO Criterion-refer 
enced testing. Paper presented at the annual AERA 
meeting. New Orleans, 1973. 

32. Wilson, H.A. A judgmental approach to criterion- 

referenced testing. C$E Monograph No. 3. Los 
Angeles: UCLA Graduate School of Education^ 
Center for the Study of Evaluation. University of 
California. Los Angeles. 1974. 

33. ZweigT R.I & Associates. Personal communicationT 

March 15. 1973. 



U 



15 



